{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n\nSimple Matrix Multiply\n======================\n**Author**: `Thierry Moreau <https://homes.cs.washington.edu/~moreau/>`_\n\nIn this tutorial, we will build on top of the `vta-get-started` tutorial\nand introduce additional concepts required to implement matrix multiplication\non VTA with the TVM workflow.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "RPC Setup\n---------\nWe start by programming the Pynq's FPGA and building its RPC runtime\nas we did in the VTA introductory tutorial.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from __future__ import absolute_import, print_function\n\nimport os\nimport tvm\nfrom tvm import te\nimport vta\nimport numpy as np\nfrom tvm import rpc\nfrom tvm.contrib import utils\nfrom vta.testing import simulator\n\n# Load VTA parameters from the 3rdparty/vta-hw/config/vta_config.json file\nenv = vta.get_env()\n\n# We read the Pynq RPC host IP address and port number from the OS environment\nhost = os.environ.get(\"VTA_RPC_HOST\", \"192.168.2.99\")\nport = int(os.environ.get(\"VTA_RPC_PORT\", \"9091\"))\n\n# We configure both the bitstream and the runtime system on the Pynq\n# to match the VTA configuration specified by the vta_config.json file.\nif env.TARGET == \"pynq\" or env.TARGET == \"de10nano\":\n\n    # Make sure that TVM was compiled with RPC=1\n    assert tvm.runtime.enabled(\"rpc\")\n    remote = rpc.connect(host, port)\n\n    # Reconfigure the JIT runtime\n    vta.reconfig_runtime(remote)\n\n    # Program the FPGA with a pre-compiled VTA bitstream.\n    # You can program the FPGA with your own custom bitstream\n    # by passing the path to the bitstream file instead of None.\n    vta.program_fpga(remote, bitstream=None)\n\n# In simulation mode, host the RPC server locally.\nelif env.TARGET in [\"sim\", \"tsim\"]:\n    remote = rpc.LocalSession()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Computation Declaration\n-----------------------\nIn this example we describe a simple matrix multiplication addition, which\nrequires multiple computation stages, as shown in the dataflow diagram below.\nFirst we describe the input tensors :code:`A` and :code:`B` that are living\nin main memory.\nSecond, we need to declare intermediate tensors :code:`A_buf` and\n:code:`B_buf`, which will live in VTA's on-chip buffers.\nHaving this extra computational stage allows us to explicitly\nstage cached reads and writes.\nThird, we describe the matrix multiplication computation over\n:code:`A_buf` and :code:`B_buf` to produce the product matrix :code:`C_buf`.\nThe last operation is a cast and copy back to DRAM, into results tensor\n:code:`C`.\n\n![](https://raw.githubusercontent.com/uwsampl/web-data/main/vta/tutorial/gemm_dataflow.png)\n\n     :align: center\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Data Layout\n~~~~~~~~~~~\nWe describe the placeholder tensors :code:`A`, and :code:`B` in a tiled data\nformat to match the data layout requirements imposed by the VTA tensor core.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div class=\"alert alert-info\"><h4>Note</h4><p>**Data Tiling**\n\n  One source of complexity when targeting accelerators is to make sure\n  that the data layout matches the layout imposed by the accelerator design.\n  VTA is designed around a *tensor core* that performs, one matrix-matrix\n  operation per cycle between an activation matrix and a weight matrix,\n  adding the result matrix to an accumulator matrix, as shown in the\n  figure below.\n\n  .. image:: https://raw.githubusercontent.com/uwsampl/web-data/main/vta/tutorial/tensor_core.png\n       :align: center\n       :width: 480px\n\n  The dimensions of that matrix-matrix multiplication are specified in\n  the :code:`vta_config.json` configuration file.\n  The activation matrix has a :code:`(BATCH, BLOCK_IN)` shape\n  and the transposed weight matrix has a :code:`(BLOCK_OUT, BLOCK_IN)` shape,\n  thus inferring that the resulting output matrix has a\n  :code:`(BATCH, BLOCK_OUT)` shape.\n  Consequently input and output tensors processed by VTA need to be\n  tiled according to these aforementioned dimension.\n\n  The diagram below shows the impact of data tiling on a matrix that is\n  originally of shape (4, 8).\n  Tiling by a (2, 2) tile shape ensures that data within each tile is\n  contiguous.\n  The resulting tiled tensor has a shape of (2, 4, 2, 2).\n\n  .. image:: https://raw.githubusercontent.com/uwsampl/web-data/main/vta/tutorial/data_tiling.png\n       :align: center\n       :width: 480px</p></div>\n\nWe first define the variables :code:`m`, :code:`n`, :code:`o` to represent\nthe shape of the matrix multiplication. These variables are multiplicative\nfactors over the :code:`BLOCK_OUT`, :code:`BLOCK_IN`, and :code:`BATCH`\ntensor dimensions respectively.\nBy default, the configuration file sets :code:`BATCH`, :code:`BLOCK_IN`, and\n:code:`BLOCK_OUT` to be 1, 16 and 16 respectively (:code:`BATCH` being set to\n1 implies that our compute building block is vector-matrix multiply).\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div class=\"alert alert-info\"><h4>Note</h4><p>**Data Types**\n\n  It's important to not only match the inner-tile\n  dimension of VTA's tensor core, but also to match the specific data types\n  expected by VTA.\n  VTA for now only supports fixed point data types, which integer width is\n  specified in the :code:`vta_config.json` file by :code:`INP_WIDTH` and\n  :code:`WGT_WIDTH` for the activations and weights data types respectively.\n  In addition, the accumulator data type integer width is specified by\n  :code:`ACC_WIDTH`.</p></div>\n\nBy default, the configuration file sets :code:`INP_WIDTH`\nand :code:`WGT_WIDTH` to 8.\nThe accumulator width :code:`ACC_WIDTH` is set to 32, in order to avoid\noverflow during accumulation.\nAs a result, :code:`env.inp_dtype` and :code:`env.wgt_dtype` are all\nnarrow 8-bit integers, while :code:`env.acc_dtype` is a standard 32-bit\ninteger.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Output channel factor m - total 16x16=256 output channels\nm = 16\n# Input channel factor n - total 16x16=256 input channels\nn = 16\n# Batch factor o (we use single batch inference)\no = 1\n# A placeholder tensor in tiled data format\nA = te.placeholder((o, n, env.BATCH, env.BLOCK_IN), name=\"A\", dtype=env.inp_dtype)\n# B placeholder tensor in tiled data format\nB = te.placeholder((m, n, env.BLOCK_OUT, env.BLOCK_IN), name=\"B\", dtype=env.wgt_dtype)\n# A copy buffer\nA_buf = te.compute((o, n, env.BATCH, env.BLOCK_IN), lambda *i: A(*i), \"A_buf\")\n# B copy buffer\nB_buf = te.compute((m, n, env.BLOCK_OUT, env.BLOCK_IN), lambda *i: B(*i), \"B_buf\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Matrix Multiplication\n~~~~~~~~~~~~~~~~~~~~~\nNow we're ready to describe the matrix multiplication result tensor :code:`C`,\nwith another compute operation.\nThe compute function takes the shape of the tensor, as well as a lambda\nfunction that describes the computation rule for each position of the tensor.\n\nIn order to implement matrix multiplication, the lambda function needs to\ninclude a reduction formula over the input channel dimension axes.\nTo create a reduction formula, we can declare a reduction axis using\n:code:`te.reduce_axis`, which takes in the range of reductions.\n:code:`te.sum` takes in the expression to be reduced as well as\nthe reduction axes to compute the sum of value over all k in the declared\nranges.\n\nNote that the reduction needs to be performed over 32-bit :code:`env.acc_dtype`\naccumulator data types.\n\nNo computation happens during this phase, as we are only declaring how\nthe computation should be done.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Outer input feature reduction axis\nko = te.reduce_axis((0, n), name=\"ko\")\n# Inner input feature reduction axis\nki = te.reduce_axis((0, env.BLOCK_IN), name=\"ki\")\n# Describe the in-VTA matrix multiplication\nC_buf = te.compute(\n    (o, m, env.BATCH, env.BLOCK_OUT),\n    lambda bo, co, bi, ci: te.sum(\n        A_buf[bo, ko, bi, ki].astype(env.acc_dtype) * B_buf[co, ko, ci, ki].astype(env.acc_dtype),\n        axis=[ko, ki],\n    ),\n    name=\"C_buf\",\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Casting the Results\n~~~~~~~~~~~~~~~~~~~\nAfter the computation is done, we'll need to send the results computed by VTA\nback to main memory.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div class=\"alert alert-info\"><h4>Note</h4><p>**Memory Store Restrictions**\n\n  One specificity of VTA is that it only supports DRAM stores in the narrow\n  :code:`env.inp_dtype` data type format.\n  This lets us reduce the data footprint for memory transfers, but also lets\n  us quantize the wide accumulator data type down to a data format that\n  matches the input activation data type.\n  This means that in the context of neural network inference, the outputs\n  of a given layer after activation can be consumed directly by the next\n  layer.</p></div>\n\nWe perform one last typecast operation to the narrow\ninput activation data format.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Cast to output type, and send to main memory\nC = te.compute(\n    (o, m, env.BATCH, env.BLOCK_OUT), lambda *i: C_buf(*i).astype(env.inp_dtype), name=\"C\"\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This concludes the computation declaration part of this tutorial.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Scheduling the Computation\n--------------------------\nWhile the above lines describes the computation rule, we can obtain\n:code:`C` in many ways.\nTVM asks the user to provide an implementation of the computation called\n*schedule*.\n\nA schedule is a set of transformations to an original computation that\ntransforms the implementation of the computation without affecting\ncorrectness.\nThis simple VTA programming tutorial aims to demonstrate basic schedule\ntransformations that will map the original schedule down to VTA hardware\nprimitives.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Default Schedule\n~~~~~~~~~~~~~~~~\nAfter we construct the schedule, by default the schedule computes\n:code:`C` in the following way:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Let's take a look at the generated schedule\ns = te.create_schedule(C.op)\nprint(tvm.lower(s, [A, B, C], simple_mode=True))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Although this schedule makes sense, it won't compile to VTA.\nIn order to obtain correct code generation, we need to apply scheduling\nprimitives and code annotation that will transform the schedule into\none that can be directly lowered onto VTA hardware intrinsics.\nThose include:\n\n - DMA copy operations which will take globally-scoped tensors and copy\n   those into locally-scoped tensors.\n - Tensor operations that will perform the matrix multiplication.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Buffer Scopes\n~~~~~~~~~~~~~\nFirst, we set the scope of the buffers to tell TVM that these buffers\nwill be living in the VTA's on-chip SRAM caches.\nBelow, we tell TVM that :code:`A_buf`, :code:`B_buf`, :code:`C_buf`\nwill respectively live in VTA's on-chip input, weight and accumulator\nmemory.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div class=\"alert alert-info\"><h4>Note</h4><p>**VTA's On-Chip SRAMs**\n\n  VTA has three different memory scopes, each corresponding to different\n  on-chip SRAM buffers.\n\n   - :code:`env.inp_scope`: Input buffer, which is a read-only SRAM buffer\n     that stores input matrices of shape :code:`(env.BATCH, env.BLOCK_IN)`\n     of type :code:`env.inp_dtype`. The input buffer contains\n     `2 ^ LOG_INP_BUFF_SIZE` matrix elements (as specified in the\n     :code:`vta_config.json` file).\n   - :code:`env.wgt_scope`: Weight buffer, which is a read-only SRAM buffer\n     that stores weight matrices of shape :code:`(env.BLOCK_OUT, env.BLOCK_IN)`\n     of type :code:`env.wgt_dtype`. The weight buffer contains\n     `2 ^ LOG_WGT_BUFF_SIZE` matrix elements.\n   - :code:`env.acc_scope`: Accumulator buffer, which is a read/write SRAM\n     buffer that stores accumulator matrices of shape\n     :code:`(env.BATCH, env.BLOCK_OUT)` of type :code:`env.acc_dtype`.\n     The accumulator buffer is VTA's general purpose register file: it holds\n     both intermediate results of convolutions and matrix multiplications\n     as well as intermediate results of pooling, batch normalization, and\n     activation layers. The accumulator buffer contains\n     `2 ^ LOG_ACC_BUFF_SIZE` matrix elements.</p></div>\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Set the intermediate tensor's scope to VTA's on-chip buffers\ns[A_buf].set_scope(env.inp_scope)\ns[B_buf].set_scope(env.wgt_scope)\ns[C_buf].set_scope(env.acc_scope)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "DMA Transfers\n~~~~~~~~~~~~~\nWe need to schedule DMA transfers to move data living in DRAM to\nand from the VTA on-chip buffers.\nThis can be achieved using the :code:`compute_at` schedule primitive\nwhich nests the copying of the buffers into the computation loop\nthat performs the matrix multiplication.\n\nWe insert :code:`dma_copy` pragmas to indicate to the compiler\nthat the copy operations will be performed in bulk via DMA,\nwhich is common in hardware accelerators.\nFinally, we print the temporary schedule to observe the effects of\nmoving the copy operations into the matrix multiplication loop.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Move buffer copy into matrix multiply loop\ns[A_buf].compute_at(s[C_buf], ko)\ns[B_buf].compute_at(s[C_buf], ko)\n\n# Tag the buffer copies with the DMA pragma to insert a DMA transfer\ns[A_buf].pragma(s[A_buf].op.axis[0], env.dma_copy)\ns[B_buf].pragma(s[B_buf].op.axis[0], env.dma_copy)\ns[C].pragma(s[C].op.axis[0], env.dma_copy)\n\n# Let's take a look at the transformed schedule\nprint(tvm.lower(s, [A, B, C], simple_mode=True))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Tensorization\n~~~~~~~~~~~~~\nThe last step of the schedule transformation consists in applying\n*tensorization* to our schedule.\nTensorization is analogous to vectorization, but extends the concept\nto a higher-dimensional unit of computation.\nConsequently, tensorization imposes data layout constraints as discussed\nearlier when declaring the data layout input placeholders.\nWe've already arranged our tensors in a tiled format, so the next thing\nwe need to perform is loop reordering to accommodate for tensorization.\n\nHere we choose to move the outermost reduction axis all the way out.\nThis dictates that we first iterate over input channels, then batch\ndimensions, and finally output channels.\nLastly, we apply the tensorization scheduling primitive :code:`tensorize`\nalong the outer axis of the inner-most matrix matrix multiplication tensor\nblock.\nWe print the finalized schedule that is ready for code-generation\nby the VTA runtime JIT compiler.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "s[C_buf].reorder(\n    ko, s[C_buf].op.axis[0], s[C_buf].op.axis[1], s[C_buf].op.axis[2], s[C_buf].op.axis[3], ki\n)\ns[C_buf].tensorize(s[C_buf].op.axis[2], env.gemm)\n\n# Let's take a look at the finalized schedule\nprint(vta.lower(s, [A, B, C], simple_mode=True))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This concludes the scheduling portion of this tutorial.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "TVM Compilation\n---------------\nAfter we have finished specifying the schedule, we can compile it\ninto a TVM function.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Build GEMM VTA kernel\nmy_gemm = vta.build(s, [A, B, C], \"ext_dev\", env.target_host, name=\"my_gemm\")\n\n# Write the compiled module into an object file.\ntemp = utils.tempdir()\nmy_gemm.save(temp.relpath(\"gemm.o\"))\n\n# Send the executable over RPC\nremote.upload(temp.relpath(\"gemm.o\"))\n\n# Load the compiled module\nf = remote.load_module(\"gemm.o\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Running the Function\n--------------------\nThe compiled TVM function uses a concise C API and can be invoked from\ncode language.\n\nTVM provides an array API in python to aid quick testing and prototyping.\nThe array API is based on `DLPack <https://github.com/dmlc/dlpack>`_ standard.\n\n- We first create a remote context (for remote execution on the Pynq).\n- Then :code:`tvm.nd.array` formats the data accordingly.\n- :code:`f()` runs the actual computation.\n- :code:`numpy()` copies the result array back in a format that can be\n  interpreted.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Get the remote device context\nctx = remote.ext_dev(0)\n\n# Initialize the A and B arrays randomly in the int range of (-128, 128]\nA_orig = np.random.randint(-128, 128, size=(o * env.BATCH, n * env.BLOCK_IN)).astype(A.dtype)\nB_orig = np.random.randint(-128, 128, size=(m * env.BLOCK_OUT, n * env.BLOCK_IN)).astype(B.dtype)\n\n# Apply packing to the A and B arrays from a 2D to a 4D packed layout\nA_packed = A_orig.reshape(o, env.BATCH, n, env.BLOCK_IN).transpose((0, 2, 1, 3))\nB_packed = B_orig.reshape(m, env.BLOCK_OUT, n, env.BLOCK_IN).transpose((0, 2, 1, 3))\n\n# Format the input/output arrays with tvm.nd.array to the DLPack standard\nA_nd = tvm.nd.array(A_packed, ctx)\nB_nd = tvm.nd.array(B_packed, ctx)\nC_nd = tvm.nd.array(np.zeros((o, m, env.BATCH, env.BLOCK_OUT)).astype(C.dtype), ctx)\n\n# Clear stats\nif env.TARGET in [\"sim\", \"tsim\"]:\n    simulator.clear_stats()\n\n# Invoke the module to perform the computation\nf(A_nd, B_nd, C_nd)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Verifying Correctness\n---------------------\nCompute the reference result with numpy and assert that the output of the\nmatrix multiplication indeed is correct\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Compute reference result with numpy\nC_ref = np.dot(A_orig.astype(env.acc_dtype), B_orig.T.astype(env.acc_dtype)).astype(C.dtype)\nC_ref = C_ref.reshape(o, env.BATCH, m, env.BLOCK_OUT).transpose((0, 2, 1, 3))\nnp.testing.assert_equal(C_ref, C_nd.numpy())\n\n# Print stats\nif env.TARGET in [\"sim\", \"tsim\"]:\n    sim_stats = simulator.stats()\n    print(\"Execution statistics:\")\n    for k, v in sim_stats.items():\n        print(\"\\t{:<16}: {:>16}\".format(k, v))\n\nprint(\"Successful matrix multiply test!\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Summary\n-------\nThis tutorial showcases the TVM workflow to implement a simple matrix\nmultiplication example on VTA.\nThe general workflow includes:\n\n- Programming the FPGA with the VTA bitstream over RPC.\n- Describing matrix multiplication via a series of computations.\n- Describing how we want to perform the computation using schedule primitives.\n- Compiling the function to the VTA target.\n- Running the compiled module and verifying it against a numpy implementation.\n\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}